18 research outputs found
Statistical Depth for Ranking and Characterizing Transformer-Based Text Embeddings
The popularity of transformer-based text embeddings calls for better
statistical tools for measuring distributions of such embeddings. One such tool
would be a method for ranking texts within a corpus by centrality, i.e.
assigning each text a number signifying how representative that text is of the
corpus as a whole. However, an intrinsic center-outward ordering of
high-dimensional text representations is not trivial. A statistical depth is a
function for ranking k-dimensional objects by measuring centrality with respect
to some observed k-dimensional distribution. We adopt a statistical depth to
measure distributions of transformer-based text embeddings, transformer-based
text embedding (TTE) depth, and introduce the practical use of this depth for
both modeling and distributional inference in NLP pipelines. We first define
TTE depth and an associated rank sum test for determining whether two corpora
differ significantly in embedding space. We then use TTE depth for the task of
in-context learning prompt selection, showing that this approach reliably
improves performance over statistical baseline approaches across six text
classification tasks. Finally, we use TTE depth and the associated rank sum
test to characterize the distributions of synthesized and human-generated
corpora, showing that five recent synthetic data augmentation processes cause a
measurable distributional shift away from associated human-generated text
Chain-of-Thought Embeddings for Stance Detection on Social Media
Stance detection on social media is challenging for Large Language Models
(LLMs), as emerging slang and colloquial language in online conversations often
contain deeply implicit stance labels. Chain-of-Thought (COT) prompting has
recently been shown to improve performance on stance detection tasks --
alleviating some of these issues. However, COT prompting still struggles with
implicit stance identification. This challenge arises because many samples are
initially challenging to comprehend before a model becomes familiar with the
slang and evolving knowledge related to different topics, all of which need to
be acquired through the training data. In this study, we address this problem
by introducing COT Embeddings which improve COT performance on stance detection
tasks by embedding COT reasonings and integrating them into a traditional
RoBERTa-based stance detection pipeline. Our analysis demonstrates that 1) text
encoders can leverage COT reasonings with minor errors or hallucinations that
would otherwise distort the COT output label. 2) Text encoders can overlook
misleading COT reasoning when a sample's prediction heavily depends on
domain-specific patterns. Our model achieves SOTA performance on multiple
stance detection datasets collected from social media.Comment: Accepted at EMNLP-2023, 8 page
Scope of Pre-trained Language Models for Detecting Conflicting Health Information
An increasing number of people now rely on online platforms to meet their
health information needs. Thus identifying inconsistent or conflicting textual
health information has become a safety-critical task. Health advice data poses
a unique challenge where information that is accurate in the context of one
diagnosis can be conflicting in the context of another. For example, people
suffering from diabetes and hypertension often receive conflicting health
advice on diet. This motivates the need for technologies which can provide
contextualized, user-specific health advice. A crucial step towards
contextualized advice is the ability to compare health advice statements and
detect if and how they are conflicting. This is the task of health conflict
detection (HCD). Given two pieces of health advice, the goal of HCD is to
detect and categorize the type of conflict. It is a challenging task, as (i)
automatically identifying and categorizing conflicts requires a deeper
understanding of the semantics of the text, and (ii) the amount of available
data is quite limited.
In this study, we are the first to explore HCD in the context of pre-trained
language models. We find that DeBERTa-v3 performs best with a mean F1 score of
0.68 across all experiments. We additionally investigate the challenges posed
by different conflict types and how synthetic data improves a model's
understanding of conflict-specific semantics. Finally, we highlight the
difficulty in collecting real health conflicts and propose a human-in-the-loop
synthetic data augmentation approach to expand existing HCD datasets. Our HCD
training dataset is over 2x bigger than the existing HCD dataset and is made
publicly available on Github
HealthE: Classifying Entities in Online Textual Health Advice
The processing of entities in natural language is essential to many medical
NLP systems. Unfortunately, existing datasets vastly under-represent the
entities required to model public health relevant texts such as health advice
often found on sites like WebMD. People rely on such information for personal
health management and clinically relevant decision making. In this work, we
release a new annotated dataset, HealthE, consisting of 6,756 health advice.
HealthE has a more granular label space compared to existing medical NER
corpora and contains annotation for diverse health phrases. Additionally, we
introduce a new health entity classification model, EP S-BERT, which leverages
textual context patterns in the classification of entity classes. EP S-BERT
provides a 4-point increase in F1 score over the nearest baseline and a
34-point increase in F1 when compared to off-the-shelf medical NER tools
trained to extract disease and medication mentions from clinical texts. All
code and data are publicly available on Github
Theme-driven Keyphrase Extraction to Analyze Social Media Discourse
Social media platforms are vital resources for sharing self-reported health
experiences, offering rich data on various health topics. Despite advancements
in Natural Language Processing (NLP) enabling large-scale social media data
analysis, a gap remains in applying keyphrase extraction to health-related
content. Keyphrase extraction is used to identify salient concepts in social
media discourse without being constrained by predefined entity classes. This
paper introduces a theme-driven keyphrase extraction framework tailored for
social media, a pioneering approach designed to capture clinically relevant
keyphrases from user-generated health texts. Themes are defined as broad
categories determined by the objectives of the extraction task. We formulate
this novel task of theme-driven keyphrase extraction and demonstrate its
potential for efficiently mining social media text for the use case of
treatment for opioid use disorder. This paper leverages qualitative and
quantitative analysis to demonstrate the feasibility of extracting actionable
insights from social media data and efficiently extracting keyphrases using
minimally supervised NLP models. Our contributions include the development of a
novel data collection and curation framework for theme-driven keyphrase
extraction and the creation of MOUD-Keyphrase, the first dataset of its kind
comprising human-annotated keyphrases from a Reddit community. We also identify
the scope of minimally supervised NLP models to extract keyphrases from social
media data efficiently. Lastly, we found that a large language model (ChatGPT)
outperforms unsupervised keyphrase extraction models, and we evaluate its
efficacy in this task.Comment: 11 pages, 2 figures, submitted to ICWSM. This version represents a
substantial expansion and refocus of the previous manuscript, including new
experiments, expanded data analysis, and comprehensive discussion
Text Encoders Lack Knowledge: Leveraging Generative LLMs for Domain-Specific Semantic Textual Similarity
Amidst the sharp rise in the evaluation of large language models (LLMs) on
various tasks, we find that semantic textual similarity (STS) has been
under-explored. In this study, we show that STS can be cast as a text
generation problem while maintaining strong performance on multiple STS
benchmarks. Additionally, we show generative LLMs significantly outperform
existing encoder-based STS models when characterizing the semantic similarity
between two texts with complex semantic relationships dependent on world
knowledge. We validate this claim by evaluating both generative LLMs and
existing encoder-based STS models on three newly collected STS challenge sets
which require world knowledge in the domains of Health, Politics, and Sports.
All newly collected data is sourced from social media content posted after May
2023 to ensure the performance of closed-source models like ChatGPT cannot be
credited to memorization. Our results show that, on average, generative LLMs
outperform the best encoder-only baselines by an average of 22.3% on STS tasks
requiring world knowledge. Our results suggest generative language models with
STS-specific prompting strategies achieve state-of-the-art performance in
complex, domain-specific STS tasks.Comment: Under review GEM@EMNLP-2023, 12 page
CitySpec with Shield: A Secure Intelligent Assistant for Requirement Formalization
An increasing number of monitoring systems have been developed in smart
cities to ensure that the real-time operations of a city satisfy safety and
performance requirements. However, many existing city requirements are written
in English with missing, inaccurate, or ambiguous information. There is a high
demand for assisting city policymakers in converting human-specified
requirements to machine-understandable formal specifications for monitoring
systems. To tackle this limitation, we build CitySpec, the first intelligent
assistant system for requirement specification in smart cities. To create
CitySpec, we first collect over 1,500 real-world city requirements across
different domains (e.g., transportation and energy) from over 100 cities and
extract city-specific knowledge to generate a dataset of city vocabulary with
3,061 words. We also build a translation model and enhance it through
requirement synthesis and develop a novel online learning framework with
shielded validation. The evaluation results on real-world city requirements
show that CitySpec increases the sentence-level accuracy of requirement
specification from 59.02% to 86.64%, and has strong adaptability to a new city
and a new domain (e.g., the F1 score for requirements in Seattle increases from
77.6% to 93.75% with online learning). After the enhancement from the shield
function, CitySpec is now immune to most known textual adversarial inputs
(e.g., the attack success rate of DeepWordBug after the shield function is
reduced to 0% from 82.73%). We test the CitySpec with 18 participants from
different domains. CitySpec shows its strong usability and adaptability to
different domains, and also its robustness to malicious inputs.Comment: arXiv admin note: substantial text overlap with arXiv:2206.0313
The Scope of In-Context Learning for the Extraction of Medical Temporal Constraints
Medications often impose temporal constraints on everyday patient activity.
Violations of such medical temporal constraints (MTCs) lead to a lack of
treatment adherence, in addition to poor health outcomes and increased
healthcare expenses. These MTCs are found in drug usage guidelines (DUGs) in
both patient education materials and clinical texts. Computationally
representing MTCs in DUGs will advance patient-centric healthcare applications
by helping to define safe patient activity patterns. We define a novel taxonomy
of MTCs found in DUGs and develop a novel context-free grammar (CFG) based
model to computationally represent MTCs from unstructured DUGs. Additionally,
we release three new datasets with a combined total of N = 836 DUGs labeled
with normalized MTCs. We develop an in-context learning (ICL) solution for
automatically extracting and normalizing MTCs found in DUGs, achieving an
average F1 score of 0.62 across all datasets. Finally, we rigorously
investigate ICL model performance against a baseline model, across datasets and
MTC types, and through in-depth error analysis
Characterizing Information Seeking Events in Health-Related Social Discourse
Social media sites have become a popular platform for individuals to seek and
share health information. Despite the progress in natural language processing
for social media mining, a gap remains in analyzing health-related texts on
social discourse in the context of events. Event-driven analysis can offer
insights into different facets of healthcare at an individual and collective
level, including treatment options, misconceptions, knowledge gaps, etc. This
paper presents a paradigm to characterize health-related information-seeking in
social discourse through the lens of events. Events here are board categories
defined with domain experts that capture the trajectory of the
treatment/medication. To illustrate the value of this approach, we analyze
Reddit posts regarding medications for Opioid Use Disorder (OUD), a critical
global health concern. To the best of our knowledge, this is the first attempt
to define event categories for characterizing information-seeking in OUD social
discourse. Guided by domain experts, we develop TREAT-ISE, a novel multilabel
treatment information-seeking event dataset to analyze online discourse on an
event-based framework. This dataset contains Reddit posts on
information-seeking events related to recovery from OUD, where each post is
annotated based on the type of events. We also establish a strong performance
benchmark (77.4% F1 score) for the task by employing several machine learning
and deep learning classifiers. Finally, we thoroughly investigate the
performance and errors of ChatGPT on this task, providing valuable insights
into the LLM's capabilities and ongoing characterization efforts.Comment: Under review AAAI-2024. 10 pages, 6 tables, 2 figue